Fun with Postcodes in R

Chris Newton

🐎
Postcodes

Using the inherently unexciting topic of ✨postcodes✨….

🐎
Postcodes

To talk about some useful R packages

  • {duckdb}
  • {sf}
  • {geos}

What this talk’s going to cover

  • Some limitations/quirks of postcode data
  • …but some (hopefully!) useful things you can still do with them:
  • Methods for approximate geocoding for lots of records, without needing an API/setting up your own geocoding server.
  • Verifying geocoding you’ve carried out via an API (where precision is needed)
  • Estimating postcode district/sector polygons (for free) and resampling statistics gathered for other geographies.

Some R packages we’ll be using:

library(readr)
library(dplyr)
library(stringr)
library(sf)
library(duckdb)
library(geos)
library(dbplyr)
library(pointblank)
library(tidygeocoder)
library(jsonlite)
library(tmap)
library(janitor)
library(tictoc)

Perils of postcodes

Postcode areas don’t map neatly onto other geographies

  • Rarely map neatly onto other geographies that official stats are available for.
  • Official versions of polygons for boundaries aren’t openly available for free.
  • Postcode areas, districts and sectors can cross multiple council areas, wards, city boundaries and even countries.

Quote from ONS

We will not be producing statistics below country level because the use of interpolation for smaller areas is additionally challenging, with the quality of the resulting values liable to be lower. We therefore recommend that comparisons between local areas in different parts of the UK should be based on the respective 2021 and 2022 census data. Users should, however, be aware of any factors that may complicate direct comparability between those dates.

ONS, 10th April 2025

… but they can still be useful

  • (Relatively!) stable across time (new builds and terminations aside).
  • Most people know their postcode district and area.
  • For better or worse, a lot of data is collected at postcode level.

Geocoding based on postcode centroids

  • Parse postcodes from address string (if not held in own column)
  • left join postcode lookup (via duckdb?)
  • transform to sf object via coord cols

Verifying geocoding

Download some Greggs data

Shout out to Aalasdair Rae’s Greggs spider map… https://alasdairrae.github.io/steakbakespider/

greggs_data <- jsonlite::fromJSON(
  "https://production-digital.greggs.co.uk/api/v1.0/shops",
  flatten = TRUE
  ) |> 
  janitor::clean_names() |> 
  dplyr::select(
    shop_name, 
    starts_with("address"), 
    -c(address_phone_number, address_country)
  )

What it looks like

as_tibble(greggs_data)
# A tibble: 2,552 × 7
   shop_name             address_house_number…¹ address_street_name address_city
   <chr>                 <chr>                  <chr>               <chr>       
 1 Bargoed, U5 The Plat… "U5"                   The Plateau         Bargoed     
 2 Rotherham, 2 Effingh… ""                     2/6 Effingham Stre… Rotherham   
 3 Nottingham, U32 Vict… "U32 Victoria Centre"  Lower Parliament S… Nottingham  
 4 Birmingham, 85 New St ""                     85 New Street       Birmingham  
 5 Daventry, U20 Bowen … "U20"                  Bowen Square        Daventry    
 6 Birmingham, U38 Grea… ""                     37/38 Great Wester… Birmingham  
 7 Bletchley, 108 Queen… ""                     108 Queensway       Bletchley   
 8 Rushden, 40 High St   ""                     40 High Street      Rushden     
 9 Outlet: Leicester, 2… ""                     237C Uppingham Road Leicester   
10 Llanharan, U3 Bridge… "U3"                   Bridgend Road       Llanharan   
# ℹ 2,542 more rows
# ℹ abbreviated name: ¹​address_house_number_or_name
# ℹ 3 more variables: address_post_code <chr>, address_longitude <dbl>,
#   address_latitude <dbl>

Transform to sf

greggs_sf <- greggs_data |> 
  st_as_sf(
    coords = c("address_longitude", "address_latitude"),
    crs = 4326 # specify CRS if known
  )

Mapping Greggs data

tmap_mode("view")
tm_shape(greggs_sf) +
  tm_markers(clustering = TRUE)

But what if we didn’t have the coordinates?

A table with addresses

greggs_table <- greggs_data |> 
  select(-c("address_longitude", "address_latitude"))

as_tibble(greggs_table)
# A tibble: 2,552 × 5
   shop_name             address_house_number…¹ address_street_name address_city
   <chr>                 <chr>                  <chr>               <chr>       
 1 Bargoed, U5 The Plat… "U5"                   The Plateau         Bargoed     
 2 Rotherham, 2 Effingh… ""                     2/6 Effingham Stre… Rotherham   
 3 Nottingham, U32 Vict… "U32 Victoria Centre"  Lower Parliament S… Nottingham  
 4 Birmingham, 85 New St ""                     85 New Street       Birmingham  
 5 Daventry, U20 Bowen … "U20"                  Bowen Square        Daventry    
 6 Birmingham, U38 Grea… ""                     37/38 Great Wester… Birmingham  
 7 Bletchley, 108 Queen… ""                     108 Queensway       Bletchley   
 8 Rushden, 40 High St   ""                     40 High Street      Rushden     
 9 Outlet: Leicester, 2… ""                     237C Uppingham Road Leicester   
10 Llanharan, U3 Bridge… "U3"                   Bridgend Road       Llanharan   
# ℹ 2,542 more rows
# ℹ abbreviated name: ¹​address_house_number_or_name
# ℹ 1 more variable: address_post_code <chr>

Download postcode centroids from the ONS Geoportal

ONS Geoportal (2021)

ONS Geoportal (2021)

1. Joining coords via {readr} & {dplyr}

tic()
greggs_coords <- greggs_table |>
  mutate(
    pcds = str_remove_all(address_post_code, " ")
  ) |>
  left_join(
    read_csv("data/NSPL_FEB_2025_UK.csv",
             show_col_types = FALSE) |> 
      select(pcds, long, lat) |> 
      mutate(pcds = str_remove_all(pcds, " ")),
    by = "pcds"
  )
toc()
18.53 sec elapsed

2. joining coords via {duckdb} & {dbplyr}

tic()
# Create duckdb connection
con <- dbConnect(duckdb())
# Register Greggs data to duckdb, so we can use it in the duckdb pipeline
duckdb_register(con, "greggs_table", greggs_table)
# Carry out the join within duckdb
greggs_coords <- tbl(con, "greggs_table") |>
  mutate(pcds = str_remove_all(address_post_code, " ")) |> 
  left_join(
    tbl(con, "data/NSPL_FEB_2025_UK.csv") |>
      select(pcds, long, lat) |> 
      mutate(pcds = str_remove_all(pcds, " ")),
    by = "pcds"
  ) |> 
  collect()
toc()
1.75 sec elapsed

~7x faster in this example - benefits scale on larger datasets

Limitations of this dataset/approach

# Pointblank can be useful for reporting (doesn't render nicely on slides though!)
pointblank::create_agent(tbl = greggs_coords) |> 
  pointblank::col_vals_not_null(
    columns = c(long, lat)
  ) |> 
  pointblank::interrogate()
Pointblank Validation
[2025-05-21|16:14:33]

tibble greggs_coords
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT

1
col_vals_not_null
 col_vals_not_null()

long

2552 2551
0.99
1
0.01

2
col_vals_not_null
 col_vals_not_null()

lat

2552 2551
0.99
1
0.01
2025-05-21 16:14:33 BST < 1 s 2025-05-21 16:14:33 BST

How did the postcode centroids compare with the API locations?

greggs_coords_sf <- greggs_coords |> 
  st_as_sf(coords = c("long","lat"), # specify X and Y columns 
           na.fail = FALSE,
           crs = 4326) |> 
  st_transform(27700) |> 
  arrange(shop_name)

distance_from_API_location <- st_distance(
  greggs_sf |> st_transform(27700) |> arrange(shop_name), 
  greggs_coords_sf,
  by_element = TRUE
  )

# add distances to both tables for easier filtering when checking geocodes look reasonable:
greggs_coords_sf$m_from_true_location <- round(
  as.numeric(distance_from_API_location), 2
  )

top5_discrepancies <- greggs_coords_sf |> 
  select(shop_name, address_street_name, address_post_code, m_from_true_location) |> 
  arrange(desc(m_from_true_location)) |> 
  slice_head(n = 5)
top5_discrepancies
Simple feature collection with 5 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 140498.4 ymin: 279437.9 xmax: 453506 ymax: 695112.3
Projected CRS: OSGB36 / British National Grid
# A tibble: 5 × 5
  shop_name           address_street_name address_post_code m_from_true_location
  <chr>               <chr>               <chr>                            <dbl>
1 EG On The Move Gib… Watling Street      LE17 6AR                        22503.
2 ASDA EXPRESS Kimme… 55 Expressway       LL18 5XE                        14220.
3 BLAKEMORE Phoenix … 205 Nottingham Road NG16 1AE                         4101.
4 ASDA EXPRESS Kilde… Kildean Business P… FK9 4UA                          3696.
5 APPLEGREEN M1 Nort… Tullyacross         BT27 5SG                         2603.
# ℹ 1 more variable: geometry <POINT [m]>

Let’s take a look…

tm_shape(top5_discrepancies) + 
  tm_dots(col = "red") +
tm_shape(greggs_sf |> filter(shop_name %in% top5_discrepancies$shop_name)) + 
  tm_dots(col = "blue")

Generating postcode boundaries from centroids

Voronoi polygons via {sf} 1

# Read in UK boundary geometry -------------------------------------------------
bounds <- read_sf("data/Countries_December_2024_Boundaries_UK_BGC.gpkg") |> 
  st_union() |> 
  st_as_sf()

# Read in centroids geopackage--------------------------------------------------
centroid_path <- "data/NSPL_Online_Latest_Centroids_2025-03-12.gpkg"

Voronoi polygons via {sf} 2

tic()
centroids <- sf::read_sf(
  centroid_path, 
  # Exclude terminated postcodes (whether they have a grid ref or not),
  # those with no grid ref, and 'Large user' postcodes (business/PO box ones?)
  query = "SELECT PCDS, LONG, LAT, SHAPE FROM NSPL_LATEST_UK
           WHERE OSGRDIND NOT IN (8, 9)
           AND USERTYPE = 0
           AND DOTERM IS NULL"
  ) |> 
  clean_names() |> 
  mutate(pcds_district = word(pcds, 1) |> str_squish())

sf_voronoi <- centroids |> 
  # Create voronoi polygons for each postcode via sf -----------------------------
  st_geometry() |> # to get sfc from sf
  st_union() |>    # to get a sfc of MULTIPOINT type
  sf::st_voronoi(envelope = st_geometry(bounds)) |> 
  st_collection_extract(type = "POLYGON") |> # a list of polygons
  st_sf() |> 
  st_join(centroids) # add attribute cols back in

# rename geometry column 
# (otherwise it'll be named after the functions that produced it.)
st_geometry(sf_voronoi) <- "geometry"
toc()

Quick plot

sf_voronoi |> 
  filter(pcds_district == "BS1") |> 
  tm_shape() + 
  tm_borders()

…but if you try to aggregate the results in {sf}…

postcode_districts <- sf_voronoi |> 
  group_by("pcds_district") |> 
  summarise(geometry = st_union(geometry))

…you’ll be waiting a long time!

Using {geos} instead

{geos} provides access to the GEOS C library through R.

geos_bounds <- bounds |> as_geos_geometry()
message("Voronoi union via {geos}:")
tic()
geos_voronoi <- centroids |> 
  as_geos_geometry() |> 
  geos_make_collection() |> 
  geos_voronoi_polygons(
    env = geos_bounds
  ) |> 
  st_as_sf() |> 
  st_collection_extract(type = "POLYGON") |> 
  # get attributes back
  st_join(centroids) |> 
  as_tibble() |> 
  group_by(pcds_district) |>
  summarise(
    geometry = geometry |>
      geos_make_collection() |>
      geos_unary_union()
  ) |> 
  # convert from tibble back to sf object
  st_as_sf() |> 
  # trim to UK bounds or some borders extend out to sea
  st_intersection(bounds)
toc() 
# ~240 seconds

Results

Compared with Google Maps:

geos_voronoi_bristol <- geos_voronoi |> 
  filter(str_detect(pcds_district, regex("^BS\\d$")))

tm_shape(geos_voronoi_bristol) +
  tm_borders() +
  tm_text(text = "pcds_district")

Interpolating official statistics to alternate geographies

Read in some census 2021 data

bristol_pop_table <- read_tsv("data/census_2021_Bristol_OA_population.tsv") |> 
  clean_names() |> 
  select(OA21CD = geogcode, population = value)

Join it to the OA boundaries

bristol_pop_oa21 <- read_sf(
  "data/Output_Areas_2021_EW_BGC_V2.gpkg",
  query = "SELECT OA21CD, SHAPE FROM OA_2021_EW_BGC_V2"
  ) |> 
  st_filter(geos_voronoi_bristol, .predicate = st_intersects) |> 
  inner_join(bristol_pop_table, by = "OA21CD")

tm_shape(bristol_pop_oa21) + tm_fill(col = "population",
                                         style = "fisher") +
  tm_shape(geos_voronoi |> 
             filter(str_detect(pcds_district, regex("^BS\\d$")))
           ) +
  tm_borders() +tm_text("pcds_district")

Interpolation via sf::st_interpolate_aw()

postcode_district_pop <- sf::st_interpolate_aw(
  bristol_pop_oa21['population'], 
  geos_voronoi_bristol,
  extensive = TRUE
  ) |> 
  st_join(geos_voronoi_bristol |> st_centroid())

Let’s take a look

tm_shape(postcode_district_pop) + 
  tm_polygons(
    title = "Population estimate",
    col = "population", 
    style = "fisher"
  ) +
  tm_text("pcds_district")

Thanks!

References/inspiration

Postcode Products - ONS, 2016

https://www.ons.gov.uk/methodology/geography/geographicalproducts/postcodeproducts

https://geoportal.statistics.gov.uk/documents/8093d03408f04240a2f11a9d8913a45e/explore

https://www.ons.gov.uk/methodology/geography/ukgeographies/postalgeography#:~:text=Limitations%20of%20using%20postcodes%20as,required%20to%20avoid%20data%20misallocation.

Parry, Josiah. 2023. “R-Spatial Beyond Sf.” September 8, 2023. https://geocompx.org/post/2023/beyond-sf/

https://longair.net/blog/2021/08/23/open-data-gb-postcode-unit-boundaries/

https://grantmcdermott.com/posts/fast-geospatial-datatable-geos/

https://r-spatial.github.io/sf/reference/interpolate_aw.html

https://alasdairrae.github.io/steakbakespider/